vision LMs are saturated over benchmarks, so we built vibe eval 💬
> compare different models with refreshed in-the-wild examples in different categories 🤠 > submit your favorite model for eval no numbers -- just vibes!
🎵 Dream come true for content creators! TIGER AI can extract voice, effects & music from ANY audio file 🤯 This lightweight model uses frequency band-split technology to separate speech like magic. Kudos to @fffiloni for the amazing demo! fffiloni/TIGER-audio-extraction
🔥 New benchmark & dataset for Subject-to-Video generation
OPENS2V-NEXUS by Pekin University ✨ Fine-grained evaluation for subject consistency BestWishYsh/OpenS2V-Eval ✨ 5M-scale dataset: BestWishYsh/OpenS2V-5M ✨ New metrics – automatic scores for identity, realism, and text match
✨Emotion-controlled, high-dynamic avatar videos ✨Multi-character support with separate audio control ✨Works with any style: cartoon, 3D, real face, while keeping identity consistent
emerging trend: models that can understand image + text and generate image + text
don't miss out ⤵️ > MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA > BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL both by ByteDance! 😱
Just completed the AI Agents course and wow, that capstone project really makes you understand how to build agents that can handle real-world complexity!
The final project uses the GAIA dataset - your agent has to solve tasks like analyzing Excel files, processing audio recordings, answering questions about YouTube videos, and diving into research papers. This isn't toy examples, it's the messy, multimodal stuff agents need to handle in practice.
Whether you’re just getting started with agents or want to go deeper with tools like LangChain, LlamaIndex, and SmolAgents, this course has tons of useful stuff. A few key insights: - Code agents are incredibly versatile once you get the architecture right - The sweet spot is finding the right balance of guidance vs autonomy for each use case - Once the logic clicks, the possibilities really are endless - it's like letting LLMs break free from the chatbox
The course is free and the certification deadline is July 1st, 2025.
A new research paper from KAIST builds on smolagents to push boundaries of distillation 🥳 ➡️ "Distilling LLM Agent into Small Models with Retrieval and Code Tools" teaches that, when trying to distil reasoning capability from a strong LLM ("teacher") into a smaller one ("student"), it's much better to use Agent traces than CoT traces.
Advantages are: 1. Improved generalization Intuitively, this is because your agent can encounter more "surprising" results by interacting with its environment : for example, a web research called by the LLM teacher in agent mode can bring results that the LLM teacher would not have generated in CoT.
2. Reduce hallucinations The trace won't hallucinate tool call outputs!
multimodal 💬🖼️ > new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) 🌚 > ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM 🐬 (OS) > Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance
> MMaDa is a new 8B diffusion language model that can generate image and text
LLMs > Mistral released Devstral, a 24B coding assistant (OS) 👩🏻💻 > Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS) > NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model > sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS) > samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)
image generation 🎨 > MTVCrafter is a new human motion animation generator
Two lines in your terminal and you have an AI agent running whatever model and tools you want 🤯
Just tried the new Tiny Agents in Python. Asked it which team won the Italian Serie A soccer league and to export the final table to CSV. Coolest thing is you can interact with the agent, guide it, and correct its mistakes.
The agent connected to web browsing tools, searched for Serie A standings, identified the champion, and generated a CSV export.
The setup:
pip install "huggingface_hub[mcp]>=0.32.0"
tiny-agents run
That's it. The MCP protocol handles all the tool integrations automatically - no custom APIs to write, no complex setups. Want file system access? It's already there. Need web browsing? Built in.
You can swap models, change inference providers, run local models, or add new tools just by editing a simple JSON config. You can also use Gradio Spaces as MCP servers! The entire agent is ~70 lines of Python - essentially a while loop that streams responses and executes tools. Everything is open-source. ❤️ Hugging Face